Background and Rationale

Singapore ranks first in the world for diabetes-induced kidney failure


The figure on the left hand side shows the percentage of diabetic kidney failure patients out of all kidney failure cases).

According to the NKF (National Kidney Foundation), in Singapore, 5.7 patients on average are diagnosed on a daily bases, making this a critical public health challenge for the society.

Despite the healthcare aspects, a large population with kidney failure is also a huge financial burden for the country, in which $190 million is spent annually on dialysis treatment.

Moreover, up to 1 million diabetic patients is predicted to be diagnosed in 2050.

Rationale

We propose a random forest model that serves as a prototypical tool for our target users - laboratory diagnosticians/doctors - to decide whether a person is suffering from renal failure as well as the stage of his/her illness. When the professional inputs the required values in accessing health outcomes of a person, the result can be displayed on the screen immediately. If put into use, it is estimated that people can benefit from our application, not including the ones who have not diagnosd with renal failures yet.

  • Singapore’s ageing population is a key factor for this high rate of kidney failure of its total population.
  • Given many data pts.

Data Cleaning

There were many missing pieces of data scattered throughout the dataset.


Prior to using the data provided to perform exploratory data analysis / train any models, it’s probably wise to do something regarding the data on the left!

We have a lot of missing data (especially the rbc, the rc, and the wc column). Since at least 25% of the data is missing in these variables, it would likely be better if these variables are left out altogether.

Furthermore, we also have missing age data. Since age appeared to be independent of most variables (i.e., not correlated in any way), it would also likely be safer to remove entries whose age are missing (i.e., one cannot predict another’s age solely using the data provided).

Using mi for model-based imputations


We note that it would be unwise to remove data as there are only 400 data points in the entire dataset. Nevertheless, we also realized that most data could not be imputed in more “conventional” ways (e.g., hot / cold-deck imputation and median / mean imputation).

The latter was especially the case for categorical variables. Hence, we decided to try and use the mi package from CRAN to impute missing values via chained regression analysis (for numerical data) and thereafter - using these results to predict the class of a data point (for categorical data).

The heatmap to the left shows where missing data is present (i.e., the dark spots)

Re-visiting the image map after imputations


Based off trial and error, we ultimately decided to calculate 80 iterations (i.e., calculations) for each regression chain. While some imputed values’ means were somewhat different in each chain, we found that 80 iterations gave the best result (i.e., only one variable had highly variable means within each chain).

Further examining the imputed values - how does the distribution of imputed data look like relative to the original, non-missing data?


mi also generates several plots for each variable that has missing data. Shown is an example - the histogram to the left implies that the distribution of data for both observed, imputed, and completed data against values predicted by mi’s regression chain (i.e., the red line).

The middle graphs show that the actual values of the data points predicted by mi are similar to non-missing data points.

The rightmost graphs are residual plots - possibly showing that the values predicted by mi are still within reason.

Exploratory Data Analysis

Are some variables more correlated with one another?


This was among one of the first actions we did with the data - examining it for multicolinearity.

Based on the correlation plot itself, we note that several variable pairs are highly correlated.

Hence, moving forward, we should be mindful of these variables lest we end up using them in the same model.

How do co-morbidies present themselves in healthy and diseased individuals?


We see that healthy individuals are in good health - good appetite with none of the mentioned co-morbidies.

We can exploit this feature knowing that no person with kidney disease is also free from hypertension or some other co-morbidity!

Possible feature engineering too using age ranges?


Out of curiosity, we also enginnered a new feature agerange that we thought could be useful when model training. We have three categories of ages defined as follows (based off commonly accepted standards in the US):

  1. Elderly => \(\le\) 65
  2. Middle age => \(\le\) 45, \(\le\) 65
  3. Young => \(\le\) 45

The amount of healthy individuals appear to increase with decreasing age - could we then use this as a feature in a classifier (i.e., lower age = higher probability of healthiness)?

Model Building

Creating testing and training sets of data

spec_tbl_df [391 x 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ age           : num [1:391] 48 7 62 48 51 60 68 24 52 53 ...
 $ bp            : num [1:391] 80 50 80 70 80 ...
 $ sg            : num [1:391] 1.02 1.02 1.01 1 1.01 ...
 $ al            : Factor w/ 6 levels "0","1","2","3",..: 2 5 3 5 3 4 1 3 4 3 ...
 $ su            : num [1:391] 0 0 3 0 0 0 0 4 0 0 ...
 $ rbc           : Factor w/ 2 levels "abnormal","normal": 2 2 2 2 2 1 1 2 2 1 ...
 $ pc            : Factor w/ 2 levels "abnormal","normal": 2 2 2 1 2 2 2 1 1 1 ...
 $ pcc           : Factor w/ 2 levels "absent","present": 1 1 1 2 1 1 1 1 2 2 ...
 $ ba            : Factor w/ 2 levels "absent","present": 1 1 1 1 1 1 1 1 1 1 ...
 $ bgr           : num [1:391] 121 171 423 117 106 ...
 $ bu            : num [1:391] 36 18 53 56 26 25 54 31 60 107 ...
 $ sc            : num [1:391] 1.2 0.8 1.8 3.8 1.4 1.1 24 1.1 1.9 7.2 ...
 $ sod           : num [1:391] 117 107 121 111 105 ...
 $ pot           : num [1:391] -4.64 0.478 -3.167 2.5 1.901 ...
 $ hemo          : num [1:391] 15.4 11.3 9.6 11.2 11.6 12.2 12.4 12.4 10.8 9.5 ...
 $ pcv           : num [1:391] 44 38 31 32 35 39 36 44 33 29 ...
 $ htn           : Factor w/ 2 levels "no","yes": 2 1 1 2 1 2 1 1 2 2 ...
 $ dm            : Factor w/ 2 levels "no","yes": 2 1 2 1 1 2 1 2 2 2 ...
 $ cad           : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ appet         : Factor w/ 2 levels "good","poor": 1 1 2 2 1 1 1 1 1 2 ...
 $ pe            : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 2 1 1 ...
 $ ane           : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 1 2 2 ...
 $ classification: Factor w/ 2 levels "Diseased","Healthy": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   age = col_double(),
  ..   bp = col_double(),
  ..   sg = col_double(),
  ..   al = col_double(),
  ..   su = col_double(),
  ..   rbc = col_character(),
  ..   pc = col_character(),
  ..   pcc = col_character(),
  ..   ba = col_character(),
  ..   bgr = col_double(),
  ..   bu = col_double(),
  ..   sc = col_double(),
  ..   sod = col_double(),
  ..   pot = col_double(),
  ..   hemo = col_double(),
  ..   pcv = col_double(),
  ..   htn = col_character(),
  ..   dm = col_character(),
  ..   cad = col_character(),
  ..   appet = col_character(),
  ..   pe = col_character(),
  ..   ane = col_character(),
  ..   classification = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 


Constructing an initial random forest model and evaluating its performance


Applications

A trivial application to predict patient(s)’ kidney health status.

Shiny applications not supported in static R Markdown documents


Some commentary here…

A RESTful API service.